要先將自己需要的數據下載好,並且確認資料格式而去做不同的處理
import pandas as pd
# 讀取資料
melbourne_file_path = './Dataset/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# 選擇目標以及特徵
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]
# 取得某一個資料的全部狀態
print('Unique values in `state` column:', list(ks.state.unique()))
Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']
get_dummies可以將dataframe自動轉為one hot encoding的模式,就可以跑這些資料了
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
原始資料
處理後資料
# 將有缺失值的row drop掉
filtered_melbourne_data = melbourne_data.dropna(axis=0)
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
Model | Describe |
---|---|
Decision Tree | 若leaf過多則會overfitting,太少會underfitting,因此要嘗試很多不同種的參數,參數調整通常都是max_leaf_nodes |
Random Forest | 可有效的解決Decision Tree發生的問題,並且在default parameters的情況下就可以表現得很好 |
LightGBM | 適用於大型資料集的樹狀模型(Tree model),為較新型的模型,其中可以調整的參數非常多 |
XGBoost | 協助調整參數的model,之後有機會講到在補充 |
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=0)
dt_model(train_X, train_y)
from sklearn.ensemble import RandomForestRegressor
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model(train_X, train_y)
pip install lightgbm
要先將資料轉成lgb.Dataset,才能做後續的處理
須將目標屬性放在label中
param為設定參數,利用dictionary的方式設定完後,再將參數丟入model中
num_round為模型跑的次數,early_stopping_rounds則是若有幾次結果是類似的,就會停止運算
import lightgbm as lgb
dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])
param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)
Validation | Describe |
---|---|
平均絕對誤差(Mean Absolute Error) | 若資料為數值屬性,則可以取得每筆資料的誤差,全部相加後做平均,即可得到評估Model好壞的標準 |
ROC AUC 分數 | 適用於資料為二分類或是類別模型,利用混亂矩陣(confusion matrix)中的false postive作為X軸,true postive作為y軸,畫出ROC曲線,並且計算其AUC(Area under the Curve of ROC)ROC曲線下的面積 |
from sklearn.metrics import mean_absolute_error
# 對測試資料做驗證
# 從驗證資料中取得預測價格
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")
不同的參數會有不同的結果,訓練各種不同的參數以達到最好的結果
寫一隻函式(function)是可以自動訓練不同參數的
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
將不同的參數丟入這個函式(function)中,並且判斷其中最好的值,將其值用於訓練全部的資料
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
# 可以用dic來紀錄資料,並取出其中最好的值
mae_result = {}
for max_leaf in candidate_max_leaf_nodes:
mae_result[max_leaf] = get_mae(max_leaf, train_X, val_X, train_y, val_y)
# 使用lambda語法,取出dic中最小的值
best_tree_size = min(mae_result, key=mae_result.get)
print(best_tree_size)
Max leaf nodes: 5 Mean Absolute Error: 347380
Max leaf nodes: 50 Mean Absolute Error: 258171
Max leaf nodes: 500 Mean Absolute Error: 243495
Max leaf nodes: 5000 Mean Absolute Error: 254983
500
final_model = DecisionTreeRegressor(max_leaf_nodes=500)
final_model.fit(X, y)
當模型(Model)訓練好後,我們可以將比賽中的測試資料丟入計算預測值,並將結果轉成csv檔案來提交比賽
# 讀取測試資料的資料夾路徑
test_data_path = './Dataset/test.csv'
# 用pandas讀取測試資料
test_data = pd.read_csv(test_data_path)
# 創建test_X來取得用於預測的columns
test_X = test_data[features]
# 取得測試資料的預測結果
test_preds = rf_model.predict(test_X)
# 整理資料格式並且輸出成csv
output = pd.DataFrame({'Id': test_data.Id,
'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)